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A comparative classification scheme provides a good basis for several approaches to understand 
proteins, including prediction of relations between their structure and biological function. But it 
remains a challenge to combine a classification scheme that describes a protein starting from its 
well organized secondary structures and often involves direct human involvement, with an atomary 
level Physics based approach where a protein is fundamentally nothing more than an ensemble of 
mutually interacting carbon, hydrogen, oxygen and nitrogen atoms. In order to bridge these two 
complementary approaches to proteins, conceptually novel tools need to be introduced. Here we 
explain how the geometrical shape of entire folded proteins can be described analytically in terms of a 
single explicit elementary function that is familiar from nonlinear physical systems where it is known 
as the kink-soliton. Our approach enables the conversion of hierarchical structural information into 
a quantitative form that allows for a folded protein to be characterized in terms of a small number of 
global parameters that are in principle computable from atomary level considerations. As an example 
we describe in detail how the native fold of the myoglobin 1M6C emerges from a combination of 
kink-solitons with a very high atomary level accuracy. We also verify that our approach describes 
longer loops and loops connecting a-helices with /3-strands, with same overall accuracy. 



Comparative protein classification schemes such as CATH pQ and SCOP [5] are among the most valuable and 
widely employed tools in bioinformatics based approaches to protein structure. These schemes classify folded proteins 
in terms of their geometric shape, starting from prevalent secondary structures such as a-helices and /3-strands. But 
at the moment the final stages of the classification usually involve manual curation, and consequently these schemes 
are best suited for qualitative analysis of folded proteins. 

The goal of the present article is to develop novel tools that we propose can eventually provide a firm quantitative 
basis for the existing protein classification schemes. Ultimately we hope to close gaps between bioinformatics based 
protein structure classification and physics based atomary level approaches to protein folding, to comprehensively 
address wide range of issues such as protein structure prediction and relations between shape, function and dynamics. 
In this way we hope to open doors to new ways to perform evolutionary, energetic and modelling studies. 

Our approach is based on the recent observation [3] , |4] that the geometric shape of helix-loop- helix motifs can be 
captured by a single elementary function that is familiar from the physics of nonlinear systems where it describes 
the kink-soliton. This function involves only a relatively small set of global parameters but still characterizes an 
entire super-secondary structure involving two (a-)helices and/or (/3-)strands in addition of the loop that connects 
them. In [3] only individual supersecondary structures in relatively simple proteins and with quite short loops were 
considered. The approach proposed there did not work very well for entire protein chains, involving several helices 
and loops, it was essentially limited to a relatively short single loop with adjoining helices. The purpose of the present 
article is to show that the method can be developed to describe an entire protein and not just its helix-loop-helix 
segments. The protein can also be quite complex, it can involve several loops, both short and long and including 
those that connect a helices with /3 strands. Furthermore, the original Ansatz can be even simplified without affecting 
its accuracy. Remarkably we observe no loss of accuracy even when the length and complexity of the protein chain 
increases. Indeed, there does not appear to be any limitations whatsoever that have to be imposed on the complexity 
of the protein, for our approach to remain practical. 

Our motivation derives from an investigation of nonlinearities that are generic in the force fields employed in 
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classical molecular dynamics, a technique that is widely used in various theoretical studies of the structure, dynamics 
and thermodynamical properties of proteins, and in determining their folding patterns in x-ray crystallography and 
NMR experiments [5]. A classical molecular dynamics approach like AMBER [5] and GROMACS [7] describes the 
evolution of a folding protein in terms of Newton's law that determines the time dependence of the atomary spatial 
coordinates X(i) = {xj(i)} 

mMt) = -Vit^(X) (1) 

Here i = 1, ...,N catalogue the individual atoms both in the protein molecule and its environment, and U(X.) is an 
empirically constructed potential energy that governs the relevant mutual interactions between all atoms involved. 
Gcncrically the potential energy is written as the sum of two terms [H] 

U(X) = J2 ^covalcnt(X) + £ U reBt {X) (2) 

The first term describes the covalent two-, three-, and four-body interactions between all covalently bonded atoms. 
The second term describes the non-covalent interactions between all atoms. For example, in the widely used harmonic 
approximation the two-body contribution to potential energy that describes the vibrational motion of all pairs of 
covalently bonded atoms acquires the familiar form 



u£L= E fcy(l*i-*il-n>ii) 2 (3) 

bond 



where ro%j are the equilibrium distances between the pairs of covalently bonded atoms i and j, and k-ij are the ensuing 
spring constants. 

But there are also nonlinear corrections to the potential energy such as (|3|, albeit in practice they may be difficult 
to account for in a systematic manner. The study of these nonlinearities forms a basis of the present work. 

We start with a Gedanken experiment where we scrutinize a highly simplified version of an improvement to the 
harmonic approximation ([3]), with only a single (relative) coordinate on a line x so that Newton's equation is mere 



where the potential has the form 



dV 
dx 



V {x ) = \ m . {x -af^\ K{x + b f. {x -af 



That is we account for nonlinear deviations from the harmonic approximation by promoting the spring constant to a x- 
dependent quantity. The equilibrium position x = a of the harmonic approximation is recovered when |x| ~ \a\ << \b\, 

V(x) « l Kh \x-a) 2 -(l + 0{^)) 
but here we retain the full potential. We introduce 

c= -\{b + a) 



and define 



y = x-^(a-b) 



to arrive at the familiar "A0 (kink) equation of motion 



with the explicit dark soliton solution 



y = — y(y - c 

m 



y(t) = c-tanhfcW— (t-t )] 
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x(t) = y{t)+ l -{a-b) 



6. e "V^(*-*o) _ a . e -°y/&{*-*°) 

(4) 



cosh[ C y^(t-i )] 

This is the hallmark dark soliton (kink) configuration that interpolates between the two uniform ground states at 
x = a and x = —b when t — > ±00. The parameters a, b, to and the combination c are the canonical ones that 

characterize the asymptotic values of x(t) i.e. minima of the potential, and the size and location of the soliton. It is 
also noteworthy that for finite t the soliton Q describes a configuration with an energy above the uniform ground 
state x = a (or x = b) but that nevertheless can not decay into x = a (or x = b) through any kind of continuous finite 
energy transformation: A soliton configuration such as Q can not be obtained from any approach that only accounts 
for perturbations that describe small localized fluctuations around the uniform background ground state. 

We argue that our example is not just an academic exercise but can be developed into a systematic tool to 
quantitatively characterize the geometrical shape of super-secondary structures such as helix-loop-helix motifs. In 
fact, we propose that the very same function Q with t a length parameter that measures distance along a static 
protein backbone, together with its asymmetric generalization of the form 

5.gCi(t— to) _ a . g-c 2 (t-to) 
X ® = e ci(t-t ) + e -c 2 (t-t Q ) ( 5 ) 

which becomes handy e.g. when we consider loops connecting an a helix with a j3 strand, can describe the geometry 
of native folds of proteins in Protein Data Bank (PDB) [5]. Besides the four canonical soliton parameters that we 
have specified, we need to introduce only two additional independent global parameters to characterize a given super- 
secondary structure such as a helix-loop-helix motif and even an entire folded protein, with an atomary level accuracy 
that matches the resolution in experimental data. 

As an explicit example we have chosen myoglobin, a widely studied oxygen-binding protein of both historical and 
biological interest that has been discussed extensively in most biochemistry texts. Specifically, we have selected the 
153 amino acid myoglobin with Protein Data Bank code 1M6C whose all-atom structure is known to an all-atom 
resolution of 1.90 A in root-mean-square distance (RMSD) from x-ray diffraction measurements [8]. We analyze it in 
detail, to show that its entire fold can be encoded into the global parameters of the elementary function Q , ^ with 
a RMSD accuracy of 1.27 A for the central C a carbons. Moreover, as the myoglobin only involves super-secondary 
structures with a and 3/10 helices that are connected by relatively short loops, we also verify that our approach can 
be extended to longer loops, and loops that connect a helices with j3 strands. For this we analyze an a helix - loop - j3 
strand segment in the HIV-1 reverse transcriptase protein with PDB code 3DLK. The loop is now clearly longer than 
those in myoglobin, nevertheless we find that it can be described with comparable RMSD accuracy by the profile ([5| . 



II. MYOGLOBIN AS MULTISOLITON 



In order to describe the PDB fold of a relatively complex protein such as the 153 amino acid 1M6C in terms of the 
single elementary function Q , we start by computing the values of its discrete Frenet curvature «j and Frenet torsion 
r, from the PDB data. The relevant equations are as follows (for detailed derivation, see [9]): From PDB we get the 
three dimensional coordinates of the central a-carbons (i = 1, ...,N). With these we compute the tangent vector 
tj and the binormal vector using 



|r <+1 - ri | 62 D * Iti-i-til 



(6) 



and the normal vector is given as 

n; = hi x t< 

These three vectors are subject to the discrete Frenet equation 



= exp{- Kl • T 2 } ■ exp{-r, • T 3 } b (7) 
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Here T 2 and T 3 are two of the standard adjoint generators of three dimensional rotations, explicitly in terms of the 
permutation tensor we have 

/rpiyk _ ( ijk 

From ([6]), |7]) we can compute m and r» as the bond angles and the torsion angles in terms of the PDB data for rj. 
Alternatively, if we know these angles we can compute the coordinates up to global rotations and translations. The 
common convention is to select the range of these angles so that Ki is non-negative. In the continuum limit where ([7]) 
becomes the standard Frenet equation for a continuous curve, Ki — > k(x) then corresponds to local curvature which 
is by convention defined to be non-negative. 

For 1M6C we take i to take values i — 3, 149 = N; We leave out three (four) sites at both end as we need three 
sites to initiate the computation of the Kj and Tj along the polygon, and the end points are anyway presumed to be 
subject to relatively large conformational fluctuations. In Figure 1 (top) we display the Ki and n along the myoglobin 
backbone, using the standard differential geometric convention that Ki is non-negative. This Figure displays the 
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FIG. 1: The values of Ki and Ti for 1M6C, obtained from PDB. In the top picture we present these values using the standard 
convention that Hi is non-negative. In the bottom picture we have resolved the soliton structure using Z2 gauge structure of the 
Frenet equation, by allowing Ki to change sign whenever there is an inflection point. This identifies the soliton structures (loops) 
along the backbone. The indexing refers to the position of amino acids along the backbone, counting from the iV-terminus. 

geometric structure of the 1M6C backbone fold: At the location of the a and 3/10 helices both Ki and r$ have pretty 
constant values, as expected for helical geometry. The difference between these two types of helices is visible in the 
Figure, in (slight) difference in the corresponding constant values of Ki and r^. At the location of loops, we note small 
variations in Ki while the values of rj are fluctuating quite wildly. In order to identify the locations of the inflection 
points that determine the center of the loops i.e. solitons, we follow 3] and subject the data in Figure 1 (top) to local 
Z2 gauge transformations in the loop regions; these transformations leave the solution of ^ intact and thus have no 
effect on the geometry of the space polygon. The result is shown in Figure 1 (bottom); the two data point sets in 
the top and bottom of Figure 1 describe the same space polygon. But from the bottom Figure 1 we conclude that in 
terms of Kj we may interpret the backbone as a space polygon with eleven helices that are separated by ten inflection 
points (soliton centers), these are the points where Kj changes its sign. Consequently we divide the backbone into ten 
super-secondary structures, each consisting of a helix-loop-helix soliton motif. These motifs are identified in Table I. 

We note that PDB lists 1M6C as an eight-helix protein. But Figure 1 reveals that there is an advantage to interpret 
it in terms of a curve with ten inflection points, so that for a match with the functional form Q we need to introduce 
ten overlapping segments. Furthermore, an examination of the PDB data reveals that there are four different types 
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of loops i.e. solitons: Those that connect two a helices, those that connect an a-helix with a 3/10-helix or vice versa, 
and finally those that connect two 3/10-helices. 

In order to describe a motif consisting of a loop together with the two similar types of helices that it connects, 
we use the Ansatz Q with the symmetric (a = b) relation for the two parameters in Q. But for motifs where a 
loop connects two different types of helices (a with 3/10) we allow these parameters to be independent, reflecting the 
difference in the helices. Thus our Ansatz for the entire backbone is the modification ^ of the Ansatz introduced in 
[3]: For the bond angles we introduce the dark soliton profile 

m , . pCr(i-S r ) _ m . p -C r (i-S r ) 

Ki = (-l) r+1 rl — ir r2 * (8) 

v ; 2 cosh[c,.(i - s r )} y ' 

and we obtain the torsion angles from this soliton profile using the relation 

1 b r 



Ti 21 + d T K\ 



(9) 



Here r = 1,...,10 labels the ten helix-loop-helix motifs of 1M6C and (c r , m r \, m r 2, s r ) are the canonical parameters 
for a kink-soliton, and (b r , d r ) are additional parameters needed to express Ti in terms of Ki. 

Note that in ^ we have simplified the Ansatz of [3] for torsion angles. Now there is no contribution from Ki in 
the numerator, thus there is one less parameter. The reason for this simplification is, that ([8]), ^ is not an ad hoc 
Ansatz but can be firmly justified in terms of the equations of motion in an underlying Hamiltonian model which is 
based on the Abelian Higgs Model [KT . The additional term used in [3J version of ^ does not have any natural 
interpretation in terms of the Abelian Higgs Model and thus there is no geometric reason for including it. Here we 
confirm that it can be safely removed, with no adverse effect in accuracy. In fact, despite the additional increased 
complexity in protein structure that we consider, the accuracy reported here is even better than that in [3J. 

We also emphasize that the parameters are all global parameters that are specific to a given helix-loop-helix motif 
and as such have no direct reference to the amino acids even though they should eventually become computable from 
an atomary level set-up. At the level of the Abelian Higgs Model [TO] each of the parameters has a well established 
interpretation in terms of charge, mass, self-coupling etc. Here, these parameters characterize the global attributes 
such as the location and the size of the soliton-loop in terms of Ki and Ti, the nature of the adjacent helices, and the 
chirality of the protein. Moreover, since all the solitons except 2 and 5 connect similar helices, whenever r ^ 2, 5 we 
can set m rl = m r 2 while for solitons number 2 and 5 that connect two different kind of helices we choose m rl ^ m r 2 . 
We also emphasize that the Ansatz involves only the single function Q, in its discrete form. This means that 
for each helix-loop-helix superstructure we only need to determine the five (or six in case the helices are different) 
global parameters. In our computations we determined these parameters using a standard Metropolis algorithm in 
combination with simulated annealing, to minimize the RMSD between the polygon described by our Ansatz and the 
C a backbone of the 1M6C protein in PDB. The actual algorithm is a very simple and straightforward application of 
standard Monte Carlo minimization that runs with PC. 

In Table I we display the parameters that yield the smallest RMSD value (RMSD = 1.27 A) that we have obtained 
when we have subjected the entire 1M6C backbone to a RMSD minimization. 

We also give the lowest RMSD values that we have obtained when we have separately optimized the parameters for 
each of the individual soliton. For the solitons 1,2,3,4,5,9 and 10 we find very low RMSD values, clearly smaller 
than the radius (~ 0.7 A) of an individual carbon atom. However, the number of sites that appear in the solitons 
3, 4, 5 are also quite small. This is due to the proximity of the ensuing solitons along the backbone. For solitons 
number 6, 7, 8 the RMSD values are somewhat larger, but the solitons are also longer. However, even in these cases 
our RMSD values are clearly below the overall 1.90 A resolution in the underlying PDB data. In Figure 2 we display 
the C a backbone of 1M6C, together with its reconstruction in terms of the Ansatz Q, 

We remind that even though our description involves five (six) free parameters for each helix-loop-helix motif, there 
is only one single function, the kink-soliton Q . These parameters can in principle be determined from a first principle 
atomary level approach to protein folding, even though in practice this is not yet possible. For this we recall [3] that as 
such, (8), (|9| is an approximate solution to a definite discrete nonlinear equation of motion in a Hamiltonian system 
that provides an effective description of a more fundamental atomary level model. 



III. LONG LOOPS 



The previous interpretation and construction of the myoglobin 1M6C backbone clearly demonstrates that the 
method proposed in [3] can be extended from helix-loop-helix super-secondary structures to entire proteins, even 
for relatively long proteins and with several helix-loop-helix combinations and both a and 3/10 helices. However, 
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TABLE I: The parameters for solitons along the 1M6C C a -backbone, with indexing starting from the N terminus. 



soliton 


1 


2 


3 


4 


5 


sites 


3-24 


22-42 


37-46 


43-50 


47-58 


type 


a-a 


a-3/10 


3/10-3/10 


3/10-3/10 


3/10-a 


b r 


78.398 


79.1807 


68.7412 


39.727 


55.9241 


C r 


1.5708 


2.5280 


2.5290 


2.5550 


3.1391 


d r 


-0.2905 


-0.1268 


-0.2347 


-0.2464 


-0.2998 


m r i 


1.53668 


1.4979 


1.56503 


1.55474 


1.5668 


m r 2 


- 


1.5113 


- 


- 


1.5651 


S r 


20.5981 


36.488 


43.3982 


45.657 


51.733 


rmsd 


0.83 


0.49 


0.15 


0.56 


0.40 


soliton 


6 


7 


8 


9 


10 


sites 


52-80 


59-98 


81-119 


102-123 


120-150 


type 


a - a 


a-a 


a-a 


a-a 


a-a 


b r 


73.358 


92.551 


48.059 


114.599 


93.2733 


C r 


2.1488 


2.1874 


1.95991 


2.2796 


2.5496 


dr 


-0.3035 


-0.4649 


-0.3688 


-0.1887 


-0.1565 




1.52541 


1.52732 


1.48823 


1.55946 


1.54715 




57.8112 


80.7367 


98.2245 


118.8551 


124.404 


rmsd 


1.12 


1.46 


1.62 


0.60 


0.37 



The solitons have some overlap with their nearest neighbors, to enable us to combine them into a single multi-soliton profile. 
The type identifies whether the soliton consists of a loop that connects a-helices and (or) 3/10-helices. 



FIG. 2: The structure of the 1M6C protein (green) together with its reconstruction in terms of our Ansatz (purple). The 
RMSD distance between the two configurations is w 1.27 A. 



the question remains whether the quality of the method becomes adversely affected if the loop length increases, 
and whether the method also describes loops that connect an a-helix with a f3 strand. We address these issues by 
considering a protein loop with 12 C a -carbons connecting an cv-helix with a /3-strand. More specifically, we consider 
the sites 398-416 in the HIV-1 reverse transcriptase protein with PDB code 3DLK. In line with the construction of 
the solitons in the case of myoglobin, we describe the super-secondary structure with the following variant ([5| of the 
Ansatz Q, ([£]), 

mi . e ci(t-a ) _ m2 . e -C2(i-«o) 

K i = T- s T- n (10) 

gci(»— so) _|_ g— c 2 (i— S D ) v ' 

and we again obtain the torsion angles from this soliton profile using the relation 
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The asymmetric choice (mi, c±) vs. (7712, C2) reflects the difference between the a- helix and /3-strand, and we now start 
the indexing by choosing i = 1 for the site 398. With the choice of parameters in Table II we find that the Ansatz 

TABLE II: The parameters for describing the sites 398-416 along 3DLK. Indexing starts with i = 1 at site 398. 



mi 


Cl 


TO2 


C2 


so 


b 


d 


57.626008 


1.836469 


58.05348 


1.8462217 


10.43150 


6601165.9 


-0.000101 



describes the 3DLK segment with a RMSD accuracy of 1.13 A; Notice that due to the presence of exponentials, for 
high accuracy it is imperative to include sufficiently many decimal points in the parameters. In Figure 3 we display 
the original 3DLK segment, together with its soliton approximation. 
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FIG. 3: Sites 398-416 in 3DLK (green; PDB indexing) and their approximation (purple) by (10 1, (111 with parameter values 
given in Table II. The RMSD distance is ~ 1.13 A 



We conclude, that the present approach is suitable not only for long protein chains such as myoglobin, but it 
also describes long loops and loops that connect very different kind of helices such as a helices, 3/10 helices and /3 
strands. However, if the loop length increases substantially, we propose that a more accurate prescription is obtained 
by describing these loops as bound states of several short loops, each with the profile (10), (fTTl). This is consistent 



with the well known fact that short supersecondary structures are known to recur many times in PDB proteins. A 
detailed analysis of long loops as bound states of short loops (multi-soliton states) will be presented elsewhere. 



IV. CONCLUSION 



Using the myoglobin 1M6C as an example, we have demonstrated that the entire native fold of a long protein 
can be described with high accuracy as a combination of kink-solitons, in a manner that involves only one single 
elementary function. In this picture, each of the solitons describe a loop configuration that interpolates between two 
different helices. By inspecting a longer loop that connects an a-helix with a /3-strand we have verified, that the 
approach remains valid with no loss of accuracy as the loop size increases. However, for substantially longer loops, we 
expect that an interpretation in terms of a multi-soliton configuration becomes more accurate both mathematically 
and phcnomcnologically. The parameters that characterize a particular protein fold are all global, and specific to its 
supersecondary helix-loop-helix motifs. Consequently the determination of these parameters becomes synonymous 
to a quantitative classification of proteins. The presence of an underlying Hamiltonian interpretation at the level 
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of motifs also strongly suggests that our approach could eventually provide a bridge between comparative protein 
classification schemes such as CATH and SCOP, and the atomary level physics based approaches to protein folding 
and structure prediction, including folding pathways and various other dynamical issues that presently can not be 
easily addressed in qualitative protein classification schemes. This should open doors to new ways of performing 
evolutionary, energetic and modelling studies. 
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